Ggplot2 I

Quantitative Methodology (UPF)

Jordi Mas Elias

https://www.jordimas.cat/

Summary

  • The layer system
  • Aesthetics
  • Geometries

Warm up

R learning curve

UPF Inclusive Growth Index (IGI)

igi <- rendacs |>
  mutate(zs_gdp = (import_euros - mean(import_euros)) / sd(import_euros),
         zs_gini = (mean(index_gini) - index_gini) / sd(index_gini),
         upf_index = (zs_gdp + zs_gini) / 2) |> 
  select(nom_barri, secc = seccio_censal, zs_gdp:upf_index) |> 
  arrange(desc(upf_index))
head(igi, 10)
# A tibble: 10 × 5
   nom_barri                      secc  zs_gdp zs_gini upf_index
   <chr>                         <dbl>   <dbl>   <dbl>     <dbl>
 1 les Corts                        24  3.59   -0.492       1.55
 2 Sant Andreu                      28  0.718   2.28        1.50
 3 la Salut                         23  1.50    1.09        1.29
 4 les Corts                        34  1.37    1.19        1.28
 5 la Maternitat i Sant Ramon       46  1.56    0.959       1.26
 6 la Vila Olímpica del Poblenou    53  2.47    0.0426      1.26
 7 Sant Gervasi- la Bonanova        39  2.84   -0.416       1.21
 8 Provençals del Poblenou         104  0.204   2.21        1.21
 9 Canyelles                        63 -0.0334  2.44        1.20
10 les Tres Torres                  28  3.80   -1.46        1.17

UPF Inclusive Growth Index (IGI)

tail(igi, 10)
# A tibble: 10 × 5
   nom_barri                              secc zs_gdp zs_gini upf_index
   <chr>                                 <dbl>  <dbl>   <dbl>     <dbl>
 1 Sant Pere, Santa Caterina i la Ribera    51 -1.15    -1.54     -1.34
 2 el Raval                                  3 -1.27    -1.43     -1.35
 3 el Barri Gòtic                           30 -0.989   -1.74     -1.36
 4 Sant Pere, Santa Caterina i la Ribera    48 -1.36    -1.38     -1.37
 5 Sant Pere, Santa Caterina i la Ribera    47 -1.04    -1.71     -1.38
 6 el Putxet i el Farró                     81 -1.69    -1.15     -1.42
 7 el Raval                                  6 -1.17    -2.02     -1.59
 8 el Barri Gòtic                           25 -0.924   -2.43     -1.68
 9 el Barri Gòtic                           29 -1.30    -2.27     -1.79
10 el Barri Gòtic                           31 -1.08    -3.04     -2.06

Made with ggplot

The layer system

Basic layers

Almost always, a ggplot consists of three layers1:

    1. Dataframe
    1. Aesthetics
    1. Geometry
df |> 
  ggplot(aes(aestethics)) +
  geometry()

Optional layers

Optionally, we add more layers, such as:

    1. Facet
    1. Coordinates
    1. Scale
    1. Theme
    1. Etc

Layered example

Example of a full-equipped plot.

bins |> 
  ggplot(aes(x = pvote, y = n, fill = type)) +
  geom_bar(stat = "identity", show.legend = F) + 
  geom_hline(yintercept = 0, size = 0.3) +
  scale_fill_manual(values = c("grey65", "grey35")) +
  facet_share(~type, dir = "h", scales = "free", reverse_num = TRUE) +
  coord_flip() +
  labs(x = NULL, fill = NULL, y = "Vote") +
  theme(panel.background = element_blank(),
        strip.background = element_blank(),
        strip.text = element_text(size = 16),
        text = element_text(size = 15),
        axis.line.x = element_line(size = 0.3),
        axis.title.x = element_text(vjust=125, size = 14))

Aesthetics

Cartesian coordinates

  • x and y.

Aesthetics

Inside aes(), what is represented by a variable:

rendacs |> 
  ggplot(aes(x = import_euros, y = index_gini, col = nom_districte)) +
  geom_point()

  • x: var horizontal axis.
  • y: var vertical axis.
  • col: geometry color.

Aesthetics vs. attributes

Color as aesthetic

rendacs |> 
  ggplot(aes(x = import_euros, 
             y = index_gini, 
             col = nom_districte)) +
  geom_point()

Color as attribute

rendacs |> 
  ggplot(aes(x = import_euros, 
             y = index_gini)) +
  geom_point(col = "red")

Aesthetics vs. attributes

Aesthetics represent a variable. Always within the aes() function. E.g.:

  • x = gdp
  • col = continent

Attributes represent characteristics of geometry. Outside the aes(), normally in the geom_xxx() function. E.g.:

  • col = "red"
  • size = 2

Aesthetics vs. attributes

Types of attributes:

  • size: size of the geometry.
  • alpha: transparency.
  • text: names.
  • labels: names.
  • fill: For bars, polygons, and things to be painted.
  • shape: Mostly for points.
  • linetype / lty: For lines.

Geometries

Geometries

  • geom_bar()
  • geom_col()
  • geom_point()
  • geom_boxplot()
  • geom_smooth()
  • … about 35 geometries!

One categoric variable

Bar plot (I)

Counts, etc.: geom_bar()

Code
accidents |> 
  ggplot(aes(x = nom_districte)) +
  geom_bar()

One numeric variable

Histogram

Number of bins: geom_histogram()

Code
accidents |> 
  ggplot(aes(x = hora_dia)) +
  geom_histogram(bins = 24)

Density plot

A lot of numeric data: geom_density()

Code
accidents |> 
  ggplot(aes(x = hora_dia)) +
  geom_density()

Dot plot

Semi-continuous, few cases (~100): geom_dotplot()

Code
accidents |> 
  filter(nom_barri == "el Camp de l'Arpa del Clot") |> 
  ggplot(aes(x = hora_dia)) +
  geom_dotplot()

Two categoric variables

Bar plot (II)

Bars are ‘filled’ with a C variable: geom_bar()

Code
accidents |>
  ggplot(aes(x = nom_districte, fill = descripcio_torn)) +
  geom_bar(position = "fill")

One categoric, one numeric

Bar plot (III)

A N variable (not a count) in the y axis: geom_col()

Code
municipi |> 
  ggplot(aes(x = comarca, y = superficie_km2)) +
  geom_col()

Boxplot (I)

Displays the median and IQR: geom_boxplot()

Code
accidents |> 
  ggplot(aes(y = edat)) +
  geom_boxplot()

Boxplot (II)

Code
elecc19 |>
  filter(nombre_de_comunidad == "Andalucía") |> 
  mutate(cs_per = cs / poblacion * 100) |> 
  ggplot(aes(x = nombre_de_provincia, y = cs_per)) +
  geom_boxplot()

Violin plot (I)

Similar to density plot: geom_violin()

Code
  elecc19 |>
  filter(nombre_de_comunidad == "Andalucía") |> 
  mutate(cs_per = cs / poblacion * 100) |> 
  ggplot(aes(x = fct_reorder(nombre_de_provincia, cs_per), y = cs_per)) +
  geom_violin() +
  coord_flip()

Violin plot (II)

Code
rendacs |> 
  ggplot(aes(x = fct_reorder(nom_districte, import_euros), y = import_euros)) +
  geom_violin() +
  coord_flip()

Points and text

For percentages: geom_point() + geom_text()

Code
lloguer_any |> 
  group_by(nom_districte) |>
  summarize(preu_m2 = round(mean(preu_m2, na.rm = T), 1)) |> 
  ggplot(aes(x = fct_reorder(nom_districte, preu_m2), y = preu_m2)) +
  geom_point(size = 10, col = "darkblue") +
  geom_text(aes(label = preu_m2), col = "white") +
  coord_flip() +
  theme_minimal() +
  labs(x = NULL, y = NULL)

Numerics in time

Line plot (I)

A timeline: geom_line()

Code
lloguer_any |> 
  group_by(any) |> 
  summarize(preu = mean(preu, na.rm = T)) |> 
  ggplot(aes(x = any, y = preu)) +
  geom_line()

Line plot (II)

Code
lloguer_any |> 
  group_by(any, nom_districte) |> 
  summarize(preu = mean(preu, na.rm = T)) |> 
  ggplot(aes(x = any, y = preu, col = nom_districte)) +
  geom_line()

Line plot: Example 1

All the code can be found here

Line plot: Example 2

Article here | Code here

Path (I)

Two numerics across time: geom_path()

Code
lloguer_any |> 
  group_by(any, nom_districte) |> 
  summarize(preu = mean(preu, na.rm = T),
            preu_m2 = mean(preu_m2, na.rm = T)) |> 
  ggplot(aes(x = preu_m2, y = preu, col = nom_districte,
             alpha = any)) +
  geom_path(size = 1.2)

Path. Example

The code can be found here

Two numerics

Points (I)

Two numeric variables: geom_point()

Code
municipi |> 
  ggplot(aes(x = altitud_m, y = superficie_km2)) +
  geom_point()

Points (II)

Code
accidents |> 
  filter(longitud > 0) |> 
  ggplot(aes(x = longitud, y = latitud)) +
  geom_point() 

Jitter

For discrete-numerics OR overplotting: geom_jitter()

Code
accidents |>
  filter(edat > 0.1) |> 
  ggplot(aes(x = edat, y = hora_dia)) +
  geom_point(position = position_jitter(width = 0.3),
             alpha = 0.3)

Text

Abbreviations: geom_text() or geom_label()

Code
festivals |> 
  group_by(ambit) |> 
  summarize(bcn = mean(assistents_a_barcelona, na.rm = T),
            fora = mean(assistents_fora_de_barcelona, na.rm = T)) |> 
  ggplot(aes(x = bcn, y = fora)) +
  geom_point() +
  geom_text(aes(label = ambit), position = position_nudge(x = 200,
                                                          y = 100)) +
  xlim(4000, 19000)

Three variables

Tile

Two categoric, one numeric: geom_tile()

Code
accidents |> 
  count(hora_dia, descripcio_dia_setmana) |> 
  ggplot(aes(x = hora_dia, y = descripcio_dia_setmana, fill = n)) +
  geom_tile()

Descriptive statistics

Mean and median

Lines with info: geom_vline() or geom_hline()

Code
municipi |> 
  ggplot(aes(x = superficie_km2)) +
  geom_density() +
  geom_vline(xintercept = mean(municipi$superficie_km2, na.rm = T),
             col = "blue") +
    geom_vline(xintercept = median(municipi$superficie_km2, na.rm = T),
             col = "green")